About the project

Write a short description about the course and add a link to your github repository here. This is an R markdown (.Rmd) file so you can use R markdown syntax. See the ‘Useful links’ page in the mooc area (chapter 1) for instructions.

The course seems really interesting. It’s cool that we have an enthusiastic teacher. I wonder if many researchers in medicine use R markdown?

Took my first steps with R markdown and Github today. Haven’t managed to change the theme of GitHub pages, this is top priority so I’m working on it.

Link to my repository.


Chapter 2: Regression and model validation excercise

On tuesday, I did the data wrangling excercises and on wednesday, the data analysis tasks. I have some prior experience in working with dplyr, so the wrangling part was not too challenging, but it was great refreshment. For example, I realised that in dplyr, it’s not too easy to calculate “rowMeans” -style of functions selecting only a part of the file - the best way to do it was by first selecting a part of dataset.

Here begins the code

setup, setting the working directory and opening the data

 ## change the working directory of this r markdown file
knitr::opts_chunk$set(root.dir = "/Users/pecsi_max/Desktop/GitHub/IODS-project/data/")
setwd("/Users/pecsi_max/Desktop/GitHub/IODS-project/data/")
getwd()

#load the required packages
library(ggplot2)
library(GGally)
library(dplyr)

Opening the data “tbl_df” converts the data frame to a “data table” -format. To my knowledge, it is recommended when using dplyr, although I believe that in these excercises it’s not a big difference.

#read the data.  

learn <- tbl_df(read.csv("/Users/pecsi_max/Desktop/GitHub/IODS-project/data/learning2014.csv"))

#structure of data
str(learn)
## Classes 'tbl_df', 'tbl' and 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
#dimensions of data
dim(learn)
## [1] 166   7

About the data

“learn” is a dataset of 166 students, with variables “gender”, “age”, “attitude”, “points”, “deep”, “stra” and “surf”. Last three are scales measuring different aspects of learning (deep, strategic or surface learning). The students scoring 0 points on exam are not included.

Exploring the data

#create a plot matrix with all the variables
pairs_plot <- ggpairs(learn, mapping = aes(alpha = 0.5), 
                      lower = list(combo = wrap("facethist", bins = 20),
                                   continuous = wrap("points", size=0.1)),
                      upper = list(continuous = wrap("cor", alpha = 1)))
pairs_plot

Comments on preliminary associations

  • Females outnumber males by almost 2:1, students are most in their 20’s
  • Attitude strongly correlates with points
  • Females have perhaps better attitudes than males.
Different aspects of learning and points
  • Strategic learning slightly positively correlated with poits
  • Deep learning not correlated with points
  • Surface learning negatively correlated with points
  • Correlations of attitude and these aspects of learning are similar

Linear regression model

Model 1

#construct a model, choosing 3 variables as dependent vars: attitude, surface learning, strategic learning
model1 <- lm(points ~ attitude + stra + surf, data = learn)
summary(model1)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learn)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## attitude     0.33952    0.05741   5.913 1.93e-08 ***
## stra         0.85313    0.54159   1.575  0.11716    
## surf        -0.58607    0.80138  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08
Comments on model 1

There is statistically significant linear association, adj. R-squared = 0.192. Only attitude appears to be statistically significantly associated with exam points. Let’s change the model…

Model 2

# testing to put all variables in the model 
model_all <- lm(points ~ attitude + gender + age + deep + stra + surf, data = learn)
summary(model_all)
## 
## Call:
## lm(formula = points ~ attitude + gender + age + deep + stra + 
##     surf, data = learn)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.4506  -3.2479   0.3879   3.5243  11.4820 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18.30246    5.25723   3.481 0.000644 ***
## attitude     0.34430    0.05978   5.760 4.24e-08 ***
## genderM     -0.05858    0.92985  -0.063 0.949845    
## age         -0.09607    0.05396  -1.780 0.076929 .  
## deep        -1.04279    0.78438  -1.329 0.185606    
## stra         0.95871    0.55150   1.738 0.084083 .  
## surf        -1.10891    0.84467  -1.313 0.191132    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.265 on 159 degrees of freedom
## Multiple R-squared:  0.2312, Adjusted R-squared:  0.2021 
## F-statistic: 7.967 on 6 and 159 DF,  p-value: 1.599e-07

Age and stra are perhaps the best predictors in addition to attitude, even though their p>0.05

Final model

Create a model with three best explaining variables

model_final <- lm(points ~ attitude + age + stra, data = learn)
summary(model_final)
## 
## Call:
## lm(formula = points ~ attitude + age + stra, data = learn)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.1149  -3.2003   0.3303   3.4129  10.7599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.89543    2.64834   4.114 6.17e-05 ***
## attitude     0.34808    0.05622   6.191 4.72e-09 ***
## age         -0.08822    0.05302  -1.664   0.0981 .  
## stra         1.00371    0.53434   1.878   0.0621 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared:  0.2182, Adjusted R-squared:  0.2037 
## F-statistic: 15.07 on 3 and 162 DF,  p-value: 1.07e-08
Comments on the final model

Adjusted R-squared = 0.2037, marginally better than in model 1. Adjusted R-squared of 0.2037 means that the model explains 20.4% in variance observed in exam points. The “adjusted” word means that the value is adjusted for the number of variables in the model.

Only attitude is statistically significantly associated with points; an increase of 1 in attitude brings 0.34 points in exam. 1 year increase in age takes 0.09 points in exam, but the relationship is not considered statistically significant. Strategic learning is “borderline significant”, associating perhaps to slightly better points. However, it should be stressed that attitude is the only variable clearly associated with exam scores.

Visualization of the model diagnostics

plot(model_final, which = (c(1, 2, 5)))



From the diagnostic plots, we can see the following:

  • in ‘Residuals’, the mean of residuals (i.e. the difference between predicted and observed values of ‘points’) is close to zero in all fitted values (in x-axis). There is not a linear, neither non-linear pattern in residuals. Some outliers (cases 56, 145, 35) are indicated.
  • in ‘Normal Q-Q’ plot, all the residuals are beautifully lined. Same outliers are indicated
  • in ‘Residuals vs. Leverage’, all the cases are well inside the Cook’s distance, which means that even the outliers follow the pattern of other data points, and that they should not change the model too much. in fact, the limit values of Cook’s distance fall outside of the plot’s range.

#create a directory path, where the data is located
path <- paste(getwd(), "data", "alcohol", sep = "/") 

#create the data
alc <- read.csv(path)
colnames(alc)
##  [1] "X"          "school"     "sex"        "age"        "address"   
##  [6] "famsize"    "Pstatus"    "Medu"       "Fedu"       "Mjob"      
## [11] "Fjob"       "reason"     "nursery"    "internet"   "guardian"  
## [16] "traveltime" "studytime"  "failures"   "schoolsup"  "famsup"    
## [21] "paid"       "activities" "higher"     "romantic"   "famrel"    
## [26] "freetime"   "goout"      "Dalc"       "Walc"       "health"    
## [31] "absences"   "G1"         "G2"         "G3"         "alc_use"   
## [36] "high_use"
glimpse(alc)
## Observations: 382
## Variables: 36
## $ X          <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ school     <fctr> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP...
## $ sex        <fctr> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F,...
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address    <fctr> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,...
## $ famsize    <fctr> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, ...
## $ Pstatus    <fctr> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T,...
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob       <fctr> at_home, at_home, at_home, health, other, services...
## $ Fjob       <fctr> teacher, other, other, services, other, other, oth...
## $ reason     <fctr> course, course, other, home, home, reputation, hom...
## $ nursery    <fctr> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ internet   <fctr> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes...
## $ guardian   <fctr> mother, father, mother, mother, father, mother, mo...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures   <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup  <fctr> yes, no, yes, no, no, no, no, yes, no, no, no, no,...
## $ famsup     <fctr> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes...
## $ paid       <fctr> no, no, yes, yes, yes, yes, no, no, yes, yes, yes,...
## $ activities <fctr> no, no, no, yes, no, yes, no, no, no, yes, no, yes...
## $ higher     <fctr> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ...
## $ romantic   <fctr> no, no, no, yes, no, no, no, no, no, no, no, no, n...
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences   <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1         <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2         <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3         <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...

Data

‘alc’ is a dataset of students portugese schools, about their performance in two subjects, mathematics and portugese language. It contains variables on students’ education, family, school performance, and alcohol use, which is a special interest in this report.
For exploration, let’s create two datasets - alc_col and alc_other. Alc_col contains factors, alc_other supposedly continuous variables. As “failures” -variable seems to be 4-level, let’s add it to alc_cols as well.

#make indices which variables are columns and which are other than columns
alc_colinds <- which(map_lgl(alc, is.factor))
alc_col <- dplyr::select(alc, alc_colinds, failures, high_use)
alc_other <- dplyr::select(alc, -alc_colinds, -failures, high_use)


#plot explorative plots 

alc_col %>% gather(key, value, -high_use) %>%
  ggplot(., aes(x=value, fill = high_use)) + geom_bar() +
   facet_wrap("key", scales = "free", shrink = TRUE)
## Warning: attributes are not identical across measure variables; they will
## be dropped

alc_other %>% gather(key, value, -high_use) %>%
  ggplot(., aes(x=high_use, y=value)) + geom_boxplot() + 
  facet_wrap("key", scales = "free", shrink = T)


Interesting variables

We select 4 interesting variables for analysis, let they be: famrel, sex, absences, goout.

  • Famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent). Probably bad family relations relate to higher alcohol use.
  • Sex - probably boys drink more than girls.
  • Absences - number of absences. Probably high if the student drinks a lot
  • G3 - the final grade. Probably bad grades are related to higher alcohol use.



Let’s do some exploratory analyses with these variables - we look at the possible differences in high and low alcohol consumption groups.

# create a dataset with the variables of interest
 alc_log <- alc %>% 
  dplyr::select(famrel, sex, absences, goout, high_use) 
  
#exploratory table
CreateTableOne(data = alc_log, strata = "high_use", vars = c("famrel", "sex", "absences", "goout"))
##                       Stratified by high_use
##                        FALSE        TRUE         p      test
##   n                     268          114                    
##   famrel (mean (sd))   4.00 (0.89)  3.78 (0.93)   0.027     
##   sex = M (%)           112 (41.8)    72 (63.2)  <0.001     
##   absences (mean (sd)) 3.71 (4.45)  6.37 (6.99)  <0.001     
##   goout (mean (sd))    2.85 (1.02)  3.72 (1.10)  <0.001


In all our variables, there appears to be a statistically significant difference in either the ratio of heavy drinkers, or the mean value of continuous variable is different in those who drink much.

The logistic regression model

modbin <- glm(data = alc_log, high_use ~ famrel + sex + absences + goout, family = "binomial") 

#look at the model
summary(modbin)
## 
## Call:
## glm(formula = high_use ~ famrel + sex + absences + goout, family = "binomial", 
##     data = alc_log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7151  -0.7820  -0.5137   0.7537   2.5463  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.76826    0.66170  -4.184 2.87e-05 ***
## famrel      -0.39378    0.14035  -2.806 0.005020 ** 
## sexM         1.01234    0.25895   3.909 9.25e-05 ***
## absences     0.08168    0.02200   3.713 0.000205 ***
## goout        0.76761    0.12316   6.232 4.59e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 379.81  on 377  degrees of freedom
## AIC: 389.81
## 
## Number of Fisher Scoring iterations: 4
#calculate CI's
modbin_odds <- exp(cbind(OR = coef(modbin), confint(modbin)))
## Waiting for profiling to be done...
modbin_odds
##                     OR      2.5 %    97.5 %
## (Intercept) 0.06277092 0.01648406 0.2225470
## famrel      0.67449969 0.51060362 0.8869199
## sexM        2.75203774 1.66768032 4.6128191
## absences    1.08510512 1.04012840 1.1353940
## goout       2.15461275 1.70345866 2.7639914


All the variables are significantly associated with high alcohol use. Family relations are inversely related - the better the family relations, the lower the odds for drinking are. Being male increases the OR to 2,8. Absences and going out are related to higher chance of high alcohol use.

Prediction

#let's add the probabilities to data frame
alc_pred <- alc %>% 
  mutate(prob = predict(modbin, type = "response")) %>%
  mutate(pred = prob>0.5) 
## Warning: package 'bindrcpp' was built under R version 3.2.5
#table 
obs_table <- table(high_use = alc_pred$high_use, prediction = alc_pred$pred) 
prob_table <- obs_table %>% prop.table %>% addmargins

obs_table
##         prediction
## high_use FALSE TRUE
##    FALSE   254   14
##    TRUE     64   50
prob_table
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.66492147 0.03664921 0.70157068
##    TRUE  0.16753927 0.13089005 0.29842932
##    Sum   0.83246073 0.16753927 1.00000000


in ’obs_table, the number of “true positives” is 50, and “false positives” is 14. “True negatives” are 254, and “false negatives” are 64. In prob_table, 66% were classified rightly as non-high drinkers, 16% were classified wrongly to non-high drinkers; 3% were classified wrongly as high-drinkers, and 13% were rightly assigned a high-drinker status.

#visualise the relation
ggplot(alc_pred, aes(x = prob, y = high_use, col = pred)) + geom_point()


we see that in high drinkers, the number of false predictions is rather high.

Proportion of false predictions

#define the loss function
loss_func <- function(class = alc_pred$high_use, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# apply the loss function to different prediction strategies
# probability of high drinking is 0 in all
loss_func(prob=0)
## [1] 0.2984293
# probability of high drinking is 1 in all
loss_func(prob=1)
## [1] 0.7015707
loss_func(prob = alc_pred$prob)
## [1] 0.2041885

If we guess at randomly that students are non-high drinkers, the prediction error is around 30%. If we think that everyone are high drinkers, the error is 70%. With our model, the training error is “only” 20%

Cross-validation

library(boot)
## Warning: package 'boot' was built under R version 3.2.5
## 
## Attaching package: 'boot'
## The following object is masked from 'package:psych':
## 
##     logit
cv <- cv.glm(data = alc_pred, cost = loss_func, glmfit = modbin, K = 10)
cv$delta[1]
## [1] 0.2198953


The cross-validation analysis is done with the function ‘cv.glm’. K is the number of pieces we split the data into. The first element of cv$delta is the prediction error (the second component would be a value adjusted for some bias). So, with 5-fold cross-validation, our model has 20,2% prediction error, which is better than the apprx. 26% error in DataCamp excercise!

SUPER-BONUS

library(boot)
colnames(alc)
##  [1] "X"          "school"     "sex"        "age"        "address"   
##  [6] "famsize"    "Pstatus"    "Medu"       "Fedu"       "Mjob"      
## [11] "Fjob"       "reason"     "nursery"    "internet"   "guardian"  
## [16] "traveltime" "studytime"  "failures"   "schoolsup"  "famsup"    
## [21] "paid"       "activities" "higher"     "romantic"   "famrel"    
## [26] "freetime"   "goout"      "Dalc"       "Walc"       "health"    
## [31] "absences"   "G1"         "G2"         "G3"         "alc_use"   
## [36] "high_use"
#2nd model
modbin2 <- glm(data=alc, high_use ~ sex + school + age + address + famsize + Pstatus + reason + internet + failures + schoolsup +
                  activities + romantic + freetime + goout + absences + G3, family = "binomial")


prob2 <- predict(modbin2, type = "response")
train2 <- loss_func(prob=prob2)

cv2 <- cv.glm(data=alc_pred, cost = loss_func, glmfit = modbin2, K = 10)

#3rd model
modbin3 <- glm(data=alc, high_use ~ sex + school + age  + failures + schoolsup +
                  activities + romantic + freetime + goout + absences + G3, family = "binomial")


prob3 <- predict(modbin3, type = "response")
train3 <- loss_func(prob=prob3)

cv3 <- cv.glm(data=alc_pred, cost = loss_func, glmfit = modbin3, K = 10)

#4th model
modbin4 <- glm(data=alc, high_use ~ sex + school + schoolsup + reason + activities + goout + freetime + absences, family = "binomial")

prob4 <- predict(modbin4, type = "response")
train4 <- loss_func(prob=prob4)

cv4 <- cv.glm(data=alc_pred, cost = loss_func, glmfit = modbin4, K = 10)


#5th model
modbin5 <- glm(data=alc, high_use ~ sex  + activities + goout + absences, family = "binomial")

prob5 <- predict(modbin5, type = "response")
train5 <- loss_func(prob=prob5)

cv5 <- cv.glm(data=alc_pred, cost = loss_func, glmfit = modbin5, K = 10)


#now create a vector of testing errors and training errors
cv_vector <- c(cv2$delta[1], cv3$delta[1], cv4$delta[1], cv5$delta[1])
train_vector <- c(train2, train3, train4, train5)

length_vector <- c(16, 11, 8, 4)


plot(cv_vector, length_vector)

plot(train_vector, length_vector)

The trend in testing error (calculated with 10-fold cross-validation) appears to be increasing as the number of variables goes up. The training error is not necessarily so, as with 4 predictors the training error appears to be greater….

Thanks for reading!


Chapter 4 - clustering and classification

About the data

We use the ‘Boston’ dataset from the MASS package. This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It contains variables about housing in Boston; more info here

#load the data & first glimpses
library(MASS)
data("Boston")
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...

The data contains 506 rows and 14 columns, i.e. variables. All variables are numeric (type ‘dbl’) except ‘chas’ and ‘rad’, that are ‘int’ and appear as factors.

#Visualize the histograms
Boston %>% gather(key, value) %>%
  ggplot(., aes(x=value)) + geom_histogram(bins = 30) + 
  facet_wrap("key", scales = "free", shrink = T)

#visualize the correlations. First, calculate the p values for correlations
p.mat <- cor.mtest(Boston, method = "spearman", exact = FALSE)

#then, calculate and plot correlations. The function leaves out insignificant correlations
cor(Boston, method = "spearman") %>% corrplot(method = "circle", 
                         type = "upper", 
                         tl.cex = 1.3, cl.pos = "b",
                         p.mat = p.mat$p, sig.level = 0.05, insig = "blank",
                         diag = FALSE)


In many variables, the distribution is rather skewed, or a single value is heavily represented. “Chas” indeed is a categorical variable with levels 0 and 1.

Concerning correlations, many variables appear to be correlated. For example, crime appears to be most associated (inversely correlated) with the distance to employment centers. Note that the circles represent spearman’s correlation coefficients. “Chas” is not correlated with the values, as it is a binomial variable.

Editing the data

let’s edit the data so that we can do some clustering and classification.

#scale Boston and turn it into a data frame. The mean of all variables becomes 0 and values are SD's from center. 
scaled_b <- Boston %>% scale %>% as.data.frame
summary(scaled_b)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
#create a factor of crime rate. With "mutate(crim = crim...)" we override the old values of 'crim' and we don't need to drop any old variables
bins <- quantile(scaled_b$crim)
scaled_b <- scaled_b %>% mutate(crim = cut(scaled_b$crim, 
                               breaks = bins, 
                               include.lowest = TRUE, 
                               label = c("low", "med_low", "med_high", "high")))


# Divide the dataset to 'train' and 'test', 80% and 20%
n <- nrow(scaled_b)
ind <- sample(n, size = 0.8*n)

train <- scaled_b[ind,]
test <- scaled_b[-ind,]

Linear discriminant analysis

lda_boss <- lda(crim ~ ., data = train)

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
#numeric classes of crime rates
classes <- as.numeric(train$crim)

# plot the lda results
plot(lda_boss, dimen = 2, col = classes, pch = classes)
lda.arrows(lda_boss, myscale = 2)

#test predicted and observed classes
observed <- test$crim
predicted <- predict(lda_boss, newdata = test)

crosstab <- table(correct = observed, predicted = predicted$class)
crosstab
##           predicted
## correct    low med_low med_high high
##   low       14       7        5    0
##   med_low    3      18        5    0
##   med_high   0      12       15    0
##   high       0       0        0   23


“High” crime rates appear to be correctly predicted. In other classes, around 50% of predictions are correct, but most wrong predictions are in adjacent classes of crime rate (e.g. correct class med_low, predicted class med_high).

K-means

Let’s reload the data for clustering.

#reload & scale Boston
data('Boston') 
boston_scaled <- Boston %>% scale %>% as.data.frame 

#euclidean distances of Boston
dist_eu <- dist(boston_scaled)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4620  4.8240  4.9110  6.1860 14.4000
#how many centroids should we have?
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')


the optimal number of clusters could be 2 because the steepest drop ends there. Let’s however select 3. The sum of squared distances drops nicely even after that.

# run the K-means cluster analysis again with 3 centroids and plot the results
kmc <- kmeans(boston_scaled, centers = 3)
Boston$cluster <- as.factor(kmc$cluster)
pairs(Boston[5:13], col = Boston$cluster, pch = 19, cex = 0.2)

Boston %>% gather(key, value, -cluster) %>%
  ggplot(., aes(x = key, y = value, group = cluster, col = cluster)) + geom_boxplot() + 
  facet_wrap("key", scales = "free", shrink = TRUE)

First figure shows the variable pairs visualized with scatter plots. The biggest differences between the 3 clusters are in variables age, criminal rates, distance to employment centers, proportion of lower-status residents (lstat), value of houses (medv), amount of nitric gas (nox), students to pupils -ratio (pratio) and so on.

BONUS

set.seed(123)
# Create another K-means clustering with 5 centroids. boston_scaled exists.
data("Boston")
boston_scaled <- Boston %>% scale %>% as.data.frame
kmc_bonus <- kmeans(boston_scaled, centers = 5)
boston_scaled$cluster <- kmc_bonus$cluster
lda_bonus <- lda(cluster ~ ., data = boston_scaled)

# visualisation. Using the same functions than in previous sections. "col" and "pch" are set to boston$clusters
plot(lda_bonus, dimen = 2, col = boston_scaled$cluster, pch = boston_scaled$cluster)
lda.arrows(lda_bonus, myscale = 2.5)

A couple of variables stands out in these clusters - Black and crim seem to contribute as individual variables. It should be noted that the results were very varying, before adding “set.seed(123) -line.”

SUPER BONUS

library(plotly)

#copy-pasting the code
model_predictors <- dplyr::select(train, -crim)

# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda_boss$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda_boss$scaling
matrix_product <- as.data.frame(matrix_product)

# k-means clustering for colors
kmc_superbonus <- kmeans(model_predictors, centers = 3)
model_predictors$cluster <- kmc_superbonus$cluster

# 3D plots
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, color = train$crim)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, color = model_predictors$cluster)